class: center, middle, inverse, title-slide # Lecture 9 ## Multiple Groups: Additional models ### Psych 10 C ### University of California, Irvine ### 04/15/2022 --- ## Final Project - As you already know, we will have a final project for this class instead of a final exam. -- - This final project is worth 50% of your total grade and it should be done in groups of between 3 and 4 students. -- - The objective is to have you write a short version of what a scientific paper should look like. Including the data analysis section. -- - To make things "easier" for you, we have selected 3 problems with their corresponding data sets that you can choose from for your final project. -- - To answer the research question of each project you will need a different approach, not all problems can be solved the same way. Therefore, before we look at the data we want you to choose one of the three problems. --- ### Problem 1: - We want to know if people that have had a brain curgical procedure known as split brain, have problems identifing obhects presented at either side of their visual field. -- ### Problem 2: - We want to know if the cognitive decline of a person from one year to the next is associated with their level of social interaction, physical activity and the fact that they used the app Luminosity during that year. -- ### Problem 3: - We want to know how the asking price for an IKEA chair changes as a function of the total cost of the matrials, if the person build the chaitr themselves and the difficulty of building it. --- ## Final project The paper must have the following points: - Introduction: 4 to 5 sentences that summarize the problem, and why it's important. The idea is to convince your reader that what you are doing matters you can use references to previous work (1 or 2 max). -- - Methods: In one paragraph, explain what is in the data set and where it came from (how it was collected). -- - Data: provide a summary of your data, you can use summary statistics like the median, mean or variance. You can also use this section to highlight some properties of the data that you are working on by using graphs. --- ## Final project - Model comparison: Describe the models that you will be comparing using your current data and their main assumptions (does the model assume that groups are equal?). - You should provide a summary of your analysis, a good way to do so can be to have a table that has the SSE of each model that you tested and the BIC associated to this model. -- - Discussion: - One paragraph on: What conclusions can you draw from the model comparisons about the experimental question. -- - One paragraph on: What might be the broader implications of this finding? -- - One paragraph on: What are the limitations of this study and/or what are the gaps left by this study? Be sure to explain why the limitation could impact the results. --- class: inverse, center, middle # Multiple Groups --- ## Comparison between multiple groups - Last class we started looking at an example where we had the level of anxiety of 3 cohorts of students at a university. The 2018, 2019 and 2020 cohorts. -- - We formalized a model that assumed that there are no difference on the levels of anxiety of First year students on the 3 cohorts (Null model) that states: `$$y_{ij}\sim\text{Normal}(\mu,\sigma_0^2)$$` for `\(i=1, \dots,30\)` students in each of the `\(j=1,2,3\)` cohorts (2018 to 2020). -- - The second model assumes that the levels of anxiety of First year students in one of the three cohorts is different from the rest. Formally, this model is expressed as: `$$y_{ij}\sim\text{Normal}(\mu_j,\sigma_e^2)$$` -- - We also mentioned that this model does not address our original question, as the conclusion that we can draw from it is that there is at least one of the groups is different from the rest. --- ## Results: review - The estimations for the Null model are: .pull-left[ ```r n_total <- nrow(anxiety) anxiety <- anxiety %>% mutate("null_pred" = mean(anxiety), "null_error" = (anxiety - null_pred)^2) sse_0 <- sum(anxiety$null_error) mse_0 <- 1/n_total * sse_0 bic_0 <- n_total * log(mse_0) + log(n_total) ``` ] .pull-right[ Prediction: 10.08 SSE: 598.46 Mean SE: 6.65 ] --- ## Results: review - The estimations from the Effects model are: .pull-left[ ```r mean_groups <- anxiety %>% group_by(cohort) %>% summarise("pred" = mean(anxiety)) anxiety <- anxiety %>% mutate("eff_pred" = case_when(cohort == "2018" ~ mean_groups$pred[1], cohort == "2019" ~ mean_groups$pred[2], cohort == "2020" ~ mean_groups$pred[3]), "eff_error" = (anxiety - eff_pred)^2) sse_e <- sum(anxiety$eff_error) mse_e <- 1/n_total * sse_e r2_e <- (sse_0 - sse_e) / sse_0 bic_e <- n_total * log(mse_e) + 3 * log(n_total) ``` ] .pull-right[ Prediction: - 2018: 9.17 - 2019: 8.93 - 2020: 12.13 SSE: 407.5 Mean SE: 4.53 ] --- ## Results: review - The conclusion that we draw from those estimations was that: -- 1. The effects model accounted for 31.9% of the total variation in anxiety scores of the students on the 3 different cohorts. -- 2. According to the BIC values, we should select the Effects model (149.42) over the Null (175.01). This means, that at least one of the 3 groups is different from the others. -- - However, our original problem stated that we wanted to know if there where any differences between the cohorts, and not that there was at least one cohort that was different. -- - To solve this problem we have to look at three additional models. --- ## Group models - From the results of the comparison between the Null and Effects model we know that there is at least one cohort of students that has a different anxiety level in comparison to the others. -- - In order to find out which group is different, we can build three new models that assume that a single cohort is different for the rest. -- - Notice that making the comparison between Null and Effects model at the start will save us a lot of work if it turns out that the best model is the Null. --- ## Three models - We need 3 models that formalize 3 assumptions: -- 1. Students in the 2018 cohort are different from students in the 2019 and 2020 cohorts. -- 2. Students in the 2019 cohort are different from students in the 2018 and 2020 cohorts. -- 3. Students in the 2020 cohort are different from students in the 2018 and 2019 cohorts. -- - To make this easy to remember we will refer to each model using the year associated with the cohort that is **different** from the rest. (e.g the "2018 model" is the one that assumes that students in the 2018 cohort are different from students in the other 2). --- ## 2018 model - We can formalize our "2018 model" using the following notation: `$$y_{i1} \sim \text{Normal}(\mu_1,\sigma_1^2)$$` `$$y_{i2},y_{i3} \sim \text{Normal}(\mu,\sigma_1^2)$$` -- - Notice that only the expected value of the anxiety level of students in the 2018 cohort `\((\mu_1)\)` is different from the other two `\((\mu)\)`. -- - Our estimators for `\(\mu_1\)`, `\(\mu\)` and `\(\sigma_1^2\)` will be similar to the ones we had before. -- - First the estimator `\(\hat{\mu}_1\)` will be the average anxiety level in the 2018 cohort, while the estimator for the other 2 cohorts `\(\hat{\mu}\)` will be the average anxiety level of all students on the 2019 and 2020 cohorts. --- ## 2018 model: Predictions - An easy way to calculate and add the predictions of this new model to our data is by first creating an "indicator" variable. This is a variable that takes one value if the observation belongs to the 2018 cohort and another value if it belongs to one of the others. ```r anxiety <- anxiety %>% mutate("id_2018" = ifelse(test = cohort == "2018", yes = "2018", no = "other")) ``` -- - Then we can use that variable to group our data and get the averages that we need. -- ```r pred_2018 <- anxiety %>% group_by(id_2018) %>% summarise("pred" = mean(anxiety)) ``` -- - The predicted anxiety level for First year students in the 2018 cohort was 9.17 and for students in the 2019 and 2020 cohorts it was 10.53. --- ## 2018 model: Error - Using the predictions of the anxiety levels we can calculate the error of the model. Again, we start by calculating the error for each observation by taking the difference between the observation and the model prediction ( `\(\hat{\mu}_1\)` and `\(\hat{\mu}\)`, respectively) and then squaring that difference. ```r anxiety <- anxiety %>% mutate("prediction_2018" = ifelse(test = id_2018 == "2018", yes = pred_2018$pred[1], no = pred_2018$pred[2])) anxiety <- anxiety %>% mutate("error_2018" = (anxiety - prediction_2018)^2) ``` -- - Once again we can get the SSE for the model by adding all of the values in the `error_2018` column. We can also use those values to calculate the proportion of variance accounted for and the BIC for the model. ```r sse_2018 <- sum(anxiety$error_2018) mse_2018 <- 1/n_total * sse_2018 r2_2018 <- (sse_0 - sse_2018)/sse_0 bic_2018 <- n_total * log(mse_2018) + 2 * log(n_total) ``` --- ## 2019 and 2020 models - Similar to our 2018 model, the other two models assume that one of the cohorts is different from the rest do we can write the two remaining models as: `$$y_{i2} \sim \text{Normal}(\mu_2,\sigma_2^2)$$` `$$y_{i1},y_{i3} \sim \text{Normal}(\mu,\sigma_2^2)$$` -- - Where `\(y_{i2}\)` represents the anxiety level of First year students in the 2019 cohort. -- - Again, our estimator `\(\hat{\mu}_2\)` of `\(\mu_2\)` will be the average of the anxiety levels of students in the 2019 cohort, while the prediction for students on the other 2 cohorts will be the average of the anxiety levels of all students in 2018 and 2020. -- - For 2020 we can write the model as: `$$y_{i3} \sim \text{Normal}(\mu_3,\sigma_3^2)$$` `$$y_{i1},y_{i2} \sim \text{Normal}(\mu,\sigma_3^2)$$` --- ## Multiple groups - Notice that in all of these models we always assign one distribution to one of the cohorts and then assign a single distribution to the remaining ones. -- - Therefore all of these models are simpler than the general Effects model which uses 3 normal distributions instead of 2. --- ## Inference: 2019 and 2020 models - First we can calculate and add the predictions for the 2019 model, remember that we can make those calculations easier by creating an indicator variable that assign's one value to our observations of the 2019 cohort and a different value to the rest. -- ```r # add indicator variable to data file. anxiety <- anxiety %>% mutate("id_2019" = ifelse(test = cohort == "2019", yes = "2019", no = "other")) # obtain the estimates of mu_2 and mu for the 2019 model. pred_2019 <- anxiety %>% group_by(id_2019) %>% summarise("pred" = mean(anxiety)) # add the predictions (estimates) as a new variable to make calculation of the # sse and mse easier. anxiety <- anxiety %>% mutate("pred_2019" = ifelse(test = id_2019 == "2019", yes = pred_2019$pred[1], no = pred_2019$pred[2])) ``` --- ## Inference: 2019 and 2020 models - Then we can get the SSE, Mean SE `\((\hat{\sigma}_2^2)\)`, the proportion of error accounted for by the 2019 model and its BIC. ```r # add variable with the errors of each observation for 2019 model anxiety <- anxiety %>% mutate("error_2019" = (anxiety - pred_2019)^2) # calculate SSE for 2019 model sse_2019 <- sum(anxiety$error_2019) # calculate Mean se for 2019 model mse_2019 <- 1/n_total * sse_2019 # calculate proportion of error accounted for by 2019 model r2_2019 <- (sse_0 - sse_2019) / sse_0 # calculate BIC for 2019 model bic_2019 <- n_total * log(mse_2019) + 2 * log(n_total) ``` --- ## Inference: 2019 and 2020 models - If we do all the same calculations for the 2020 model now we will be able to compare all of our models and reach a conclusion about our original problem. .panelset[ .panel[.panel-name[Predictions] ```r # add indicator variable to data file. anxiety <- anxiety %>% mutate("id_2020" = ifelse(test = cohort == "2020", yes = "2020", no = "other")) # obtain the estimates of mu_2 and mu for the 2020 model. pred_2020 <- anxiety %>% group_by(id_2020) %>% summarise("pred" = mean(anxiety)) # add the predictions (estimates) as a new variable to make calculation of the # sse and mse easier. anxiety <- anxiety %>% mutate("pred_2020" = ifelse(test = id_2020 == "2020", yes = pred_2020$pred[1], no = pred_2020$pred[2])) ``` ] .panel[.panel-name[Error and Evaluation] ```r # add variable with the errors of each observation for 2020 model anxiety <- anxiety %>% mutate("error_2020" = (anxiety - pred_2020)^2) # calculate SSE for 2020 model sse_2020 <- sum(anxiety$error_2020) # calculate Mean se for 2020 model mse_2020 <- 1/n_total * sse_2020 # calculate proportion of error accounted for by 2020 model r2_2020 <- (sse_0 - sse_2020) / sse_0 # calculate BIC for 2020 model bic_2020 <- n_total * log(mse_2020) + 2 * log(n_total) ``` ] ] --- ## Interpretation and conclusion: - This time we have a total of 5 models that we want to compare. -- - Our interpretation and conclusion have to be drawn from the results of that comparison, so we want to summarize the results in a way that makes it easy for us to communicate it. -- - In order to present the results of multiple comparisons between models we often use a table. This table should include the value of the Mean Squared Error (our estimator of each sigma), the proportion of variance accounted for by each model and the BIC. -- - For our anxiety levels example we can report the following table: --- ### Results | Model | Number of Parameters | Mean-SE | `\(R^2\)` | BIC | |---------|:--------------------:|:--------------------:|---------------------|----------------------| | Null | 1 | 6.65 | NA | 175.01 | | 2018 | 2 | 6.23| 0.062| 173.71| | 2019 | 2 | 5.99| 0.098| 170.18| | 2020 | 2 | 4.54| 0.318| 145.1| | Effects | 3 | 4.53 | 0.319 | 149.42 | -- - The first thing that we need to notice is that all the models that look at the effect of a single year have two parameters, which means that they assume that there are only two populations (Normal distributions). -- - Second, is that the Mean SE is always lower for models with more parameters. -- - Finally, that from all the models, the BIC of the 2020 was the lowest even though the Mean SE of that model is Higher than the Effects model. This is due to the added flexibility of the Effects model being "punished" by the BIC. --- ## Discussion - Now we can draw a conclusion based on the results of our model comparison. -- - From the BIC values obtained for each model we could conclude that: -- - Based on the BIC values obtained from the models, the one that better accounts for our observations was the model that assumes that the anxiety levels of First year students in the 2020 cohort are different from the levels of students from 2018 or 2019. --- ## Discussion - Based on the estimates of `\(\mu_3\)` and `\(\mu\)` from the 2020 model we can say that: -- - According to our model, the average anxiety level of First year students in the 2020 cohort was approximately 12.13, in comparison with the average of students in the 2018 and 2019 cohorts was 9.05. This suggests that anxiety levels on the 2020 cohort where higher. --- ## Discussion - Finally, based on the proportion of error accounted for by the model we could add: -- - The model that assumes that only the cohort of 2020 was different from the others accounts for 32% of the total variation in anxiety levels of First year students.